BONN: Bayesian Optimized Binary Neural Network

73

Algorithm 5 Optimizing 1-bit CNNs with Bayesian Learning

Input:

The full-precision kernels k, the reconstruction vector w, the learning rate η, regularization

parameters λ, θ and variance ν, and the training dataset.

Output:

The BONN with the updated k, w, μ, σ, cm, σm.

1: Initialize k and w randomly, and then estimate μ, σ based on the average and variance of k,

respectively;

2: repeat

3:

// Forward propagation

4:

for l = 1 to L do

5:

ˆkl

i = wlsign(kl

i),i; // Each element of wl is replaced by the average of all elements wl.

6:

Perform activation binarization; // Using the sign function

7:

Perform 2D convolution with ˆkl

i,i;

8:

end for

9:

// Backward propagation

10:

Compute δˆkl

i = ∂Ls

ˆkl

i ,l, i;

11:

for l = L to 1 do

12:

Calculate δkl

i, δwl, δμl

i, δσl

i; // using Eqs. 3.1123.119

13:

Update parameters kl

i, wl, μl

i, σl

i using SGD;

14:

end for

15:

Update cm, σm;

16: until convergence

where w denotes a learned vector to reconstruct the full precision vector and is shared in a

layer. As mentioned in Section 3.2, during forward propagation, wl becomes a scalar wl in

each layer, where wl is the mean of wl and is calculated online. The convolution process is

represented as

Ol+1 = ((wl)1 ˆ

Kl)ˆ

Ol = (wl)1( ˆ

Kl ˆ

Ol),

(3.111)

where ˆ

Ol denotes the binarized feature map of the l-th layer, and Ol+1 is the feature map

of the (l + 1)-th layer. As in Eq. 3.111 depicts, the actual convolution is still binary, and

Ol+1 is obtained by simply multiplying (wl)1 and the binarization convolution. For each

layer, only one floating-point multiplication is added, which is negligible for BONNs.

In addition, we consider the Gaussian distribution in the forward process of Bayesian

pruning, which updates every filter in one group based on its mean. Specifically, we replace

each filter Kl

i,j = (1γ)Kl

i,j + γK

l

j during pruning.

3.7.6

Asynchronous Backward Propagation

To minimize Eq. 3.108, we update kl,i

n , wl, μl

i, σl

i, cm, and σm using stochastic gradient

descent (SGD) in an asynchronous manner, which updates w instead of w as elaborated

below.

Updating kl,i

n : We define δkl,i

n as the gradient of the full-precision kernel kl,i

n , and we have:

δkl,i

n = ∂L

kl,i

n

= ∂LS

kl,i

n

+ ∂LB

kl,i

n

.

(3.112)